Project Code

Team Zzz

Author

Aarti Pappu, Divya Bhardwaj, Diego Schummer, and Yasmeen Nahas

Published

June 7, 2023

Abstract
This file contains the code for the project on predicting the quality of white wines based on their objective physicochemical properties, as part of the STAT303-3 course in Spring 2023.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score,train_test_split, KFold, cross_val_predict
from sklearn.metrics import mean_squared_error,r2_score,roc_curve,auc,precision_recall_curve, accuracy_score, \
recall_score, precision_score, confusion_matrix, mean_absolute_error, f1_score, cohen_kappa_score, matthews_corrcoef, classification_report
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, ParameterGrid, StratifiedKFold, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingRegressor,GradientBoostingClassifier,BaggingRegressor,BaggingClassifier, \
AdaBoostRegressor,AdaBoostClassifier,RandomForestRegressor,RandomForestClassifier
from sklearn.ensemble import VotingClassifier, StackingClassifier
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from pyearth import Earth
import itertools as it
import time as time
import xgboost as xgb
import re 

1 Data quality check / cleaning / preparation

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text (in a markdown cell) before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. Put the name of the person / persons who contributed to each code chunk / set of code chunks. An example is given below.

Our dataset originates from https://archive.ics.uci.edu/ml/datasets/Wine+Quality

To find the dataset below, click on “Data Folder” and then “winequality-white.csv”.

# Load in the original dataset for white wine
df = pd.read_csv('winequality-white.csv', sep=';')

1.1 Distribution of response

By Diego Schummer

# Histogram for the distribution of the response (quality)
plt.hist(df['quality'], bins=10)
plt.title('Histogram of Quality Ratings')
plt.xlabel('Quality Rating')
plt.ylabel('Frequency')
plt.show()

# Check the proportion of each class of 'quality' in the whole dataset
print(y.value_counts(normalize = True))

6    0.448755
5    0.297468
7    0.179665
8    0.035729
4    0.033279
3    0.004083
9    0.001021
Name: quality, dtype: float64

1.2 Data cleaning

By Diego Schummer & Yasmeen Nahas

No data cleaning was necessary since there were no missing values and the structure/data types of the response and predictors were usuable for our models (see code below)

# Checking the structure and data types
print(df.info())

# Checking for missing values
print(df.isnull().sum())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB
None
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

1.3 Data preparation

By Diego Schummer

To prepare our data for analysis, we split the data into train and test sets, The following data preparation steps helped us to prepare our data for implementing various modeling / validation techniques:

  1. We split our data into a training and test set so that we could tune the hyperparameters for each model using the train set, and then test our models on the test set. However, when splitting, we stratified according to the response to account for the imbalance in the class dataset and ensure that the training and testing sets have similar distributions of the target variable, which can lead to more accurate and reliable model performance estimates.

  2. We standardized our dataset to account for the different ranges of the predictors in the dataset.

# Define the features and the target
X = df.drop('quality', axis=1)
y = df['quality']

# Split the data into train and test sets, with stratification on the 'quality' column
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns, index = X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns, index = X_test.index)

# Check the distribution of 'quality' in the whole dataset, the training set and the test set
print("Whole dataset:\n", y.value_counts(normalize=True))
print("\nTraining set:\n", y_train.value_counts(normalize=True))
print("\nTest set:\n", y_test.value_counts(normalize=True))
Whole dataset:
 6    0.448755
5    0.297468
7    0.179665
8    0.035729
4    0.033279
3    0.004083
9    0.001021
Name: quality, dtype: float64

Training set:
 6    0.448698
5    0.297601
7    0.179684
8    0.035733
4    0.033180
3    0.004084
9    0.001021
Name: quality, dtype: float64

Test set:
 6    0.448980
5    0.296939
7    0.179592
8    0.035714
4    0.033673
3    0.004082
9    0.001020
Name: quality, dtype: float64

2 Exploratory data analysis

By Yasmeen Nahas

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text (in a markdown cell) before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. Put the name of the person / persons who contributed to each code chunk / set of code chunks.

# Visualizing the distribution of the response 
sns.histplot(data=df, x='quality')
<AxesSubplot:xlabel='quality', ylabel='Count'>

# Correlation between predictors, and between predictors and the response
# Plots between predictors and the response helps to identify the possibility of outliers
sns.pairplot(df)

# Step 1: Identify outliers for each predictor
outliers = {}  # Dictionary to store outliers for each predictor

for column in df.columns:
    # Calculate z-scores for each column
    z_scores = np.abs((df[column] - df[column].mean()) / df[column].std())
    # Define a threshold for outlier detection (e.g., z-score > 3)
    threshold = 3
    # Identify outliers based on the threshold
    outliers[column] = df[z_scores > threshold]

# Step 2: Combine outliers from all predictors
combined_outliers = pd.concat(outliers.values()).drop_duplicates()

# Step 3: Determine common outliers
common_outliers = combined_outliers[combined_outliers.duplicated(keep=False)]

# Display the common outliers
print("Common Outliers:")
print(common_outliers)
Common Outliers:
Empty DataFrame
Columns: [fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality]
Index: []

There are no common outliers - the outliers are different observations for different predictors.

# Checking the distribution of the variables from for individual predictors
selected_features = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides',
                     'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']

# Calculate the number of rows and columns based on the total number of features
num_features = len(selected_features)
num_rows = int(num_features ** 0.5)
num_cols = (num_features + num_rows - 1) // num_rows

# Create a grid of subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 15))

# Iterate over the selected features and plot each density plot on a different subplot
for i, feature in enumerate(selected_features):
    row = i // num_cols
    col = i % num_cols
    ax = axes[row, col] if num_rows > 1 else axes[col]
    
    sns.distplot(df[feature], ax=ax, kde=False, hist_kws={'edgecolor': 'black'})
    ax.set_title(feature)
    ax.set_xlabel('')
    ax.set_ylabel('Density')
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    

plt.tight_layout()
plt.show()

corr = df.corr()
corr
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
fixed acidity 1.000000 -0.022697 0.289181 0.089021 0.023086 -0.049396 0.091070 0.265331 -0.425858 -0.017143 -0.120881 -0.113663
volatile acidity -0.022697 1.000000 -0.149472 0.064286 0.070512 -0.097012 0.089261 0.027114 -0.031915 -0.035728 0.067718 -0.194723
citric acid 0.289181 -0.149472 1.000000 0.094212 0.114364 0.094077 0.121131 0.149503 -0.163748 0.062331 -0.075729 -0.009209
residual sugar 0.089021 0.064286 0.094212 1.000000 0.088685 0.299098 0.401439 0.838966 -0.194133 -0.026664 -0.450631 -0.097577
chlorides 0.023086 0.070512 0.114364 0.088685 1.000000 0.101392 0.198910 0.257211 -0.090439 0.016763 -0.360189 -0.209934
free sulfur dioxide -0.049396 -0.097012 0.094077 0.299098 0.101392 1.000000 0.615501 0.294210 -0.000618 0.059217 -0.250104 0.008158
total sulfur dioxide 0.091070 0.089261 0.121131 0.401439 0.198910 0.615501 1.000000 0.529881 0.002321 0.134562 -0.448892 -0.174737
density 0.265331 0.027114 0.149503 0.838966 0.257211 0.294210 0.529881 1.000000 -0.093591 0.074493 -0.780138 -0.307123
pH -0.425858 -0.031915 -0.163748 -0.194133 -0.090439 -0.000618 0.002321 -0.093591 1.000000 0.155951 0.121432 0.099427
sulphates -0.017143 -0.035728 0.062331 -0.026664 0.016763 0.059217 0.134562 0.074493 0.155951 1.000000 -0.017433 0.053678
alcohol -0.120881 0.067718 -0.075729 -0.450631 -0.360189 -0.250104 -0.448892 -0.780138 0.121432 -0.017433 1.000000 0.435575
quality -0.113663 -0.194723 -0.009209 -0.097577 -0.209934 0.008158 -0.174737 -0.307123 0.099427 0.053678 0.435575 1.000000
corr > 0.5
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
fixed acidity True False False False False False False False False False False False
volatile acidity False True False False False False False False False False False False
citric acid False False True False False False False False False False False False
residual sugar False False False True False False False True False False False False
chlorides False False False False True False False False False False False False
free sulfur dioxide False False False False False True True False False False False False
total sulfur dioxide False False False False False True True True False False False False
density False False False True False False True True False False False False
pH False False False False False False False False True False False False
sulphates False False False False False False False False False True False False
alcohol False False False False False False False False False False True False
quality False False False False False False False False False False False True

3 Developing the model: Hyperparameter tuning

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text (in a markdown cell) before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. Put the name of the person / persons who contributed to each code chunk / set of code chunks.

Put each model in a section of its name and mention the name of the team-member tuning the model. Below is an example:

3.1 Decision Tree

By Aarti Pappu

First, I started with a basic model to better understand the ranges of the different hyperparameters of the decision tree so I could know where to start my tuning.

# Defining the object to build a regression tree
model = DecisionTreeClassifier(random_state=1) 

#Fitting the regression tree to the data
model.fit(X_train, y_train)

print("Maximum number of leaves:", model.get_n_leaves())
print("Maximum depth:", model.get_depth())
print("Maximum features:", len(X_train.columns))
Maximum number of leaves: 1037
Maximum depth: 25
Maximum features: 11

Using the hyperparameter ranges obtained, as well as my intuition about other hyperparameters that could be important in tuning the tree (min_samples_leaf and min_samples_split), I did a course grid search to try to narrow the range of the optimal hyperparameters. I then plotted the results of the coarse grid search to better understand the ranges of the optimal hyperparameters.

# Coarse grid search parameter grid
param_grid = {    
    'criterion':['gini','entropy'],
    'max_depth': range(2,26,5),
    'max_leaf_nodes': range(2,1038,100),
    'max_features': range(1, 12,3),
    'min_samples_leaf': range(1,10,2),
    'min_samples_split': range(2,10,2)
}

# Using 2-fold CV because of the limited number of instances of some of the classes 
skf = StratifiedKFold(n_splits=2)

# Aiming to maximize F1-score
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=1), param_grid, scoring=['f1_weighted','accuracy'], refit= 'f1_weighted', cv=skf, n_jobs=-1, verbose = True)
grid_search.fit(X_train, y_train)

# Make the predictions
y_pred = grid_search.predict(X_test)

print('Train F1-score : %.3f'%grid_search.best_estimator_.score(X_train, y_train))
print('Test F1-score : %.3f'%grid_search.best_estimator_.score(X_test, y_test))
print('Best F1-score Through Grid Search : %.3f'%grid_search.best_score_)

print('Best params for F1-score')
print(grid_search.best_params_)
Fitting 2 folds for each of 8800 candidates, totalling 17600 fits
Train F1-score : 0.882
Test F1-score : 0.571
Best F1-score Through Grid Search : 0.540
Best params for F1-score
{'criterion': 'entropy', 'max_depth': 22, 'max_features': 4, 'max_leaf_nodes': 702, 'min_samples_leaf': 1, 'min_samples_split': 2}
cv_results = pd.DataFrame(grid_search.cv_results_)
fig, axes = plt.subplots(3,2,figsize=(18,20))
plt.subplots_adjust(wspace=0.2)
axes[0,0].plot(cv_results.param_max_depth, cv_results.mean_test_f1_weighted, 'o')
axes[0,0].set_ylim([0,1])
axes[0,0].set_xlabel('max_depth')
axes[0,0].set_ylabel('K-fold F1 Score')
axes[0,1].plot(cv_results.param_max_leaf_nodes, cv_results.mean_test_f1_weighted, 'o')
axes[0,1].set_ylim([0,1])
axes[0,1].set_xlabel('max_leaf_nodes')
axes[0,1].set_ylabel('K-fold F1 Score')
axes[1,0].plot(cv_results.param_max_features, cv_results.mean_test_f1_weighted, 'o')
axes[1,0].set_ylim([0,1])
axes[1,0].set_xlabel('max_features')
axes[1,0].set_ylabel('K-fold F1 Score')
axes[1,1].plot(cv_results.param_min_samples_leaf, cv_results.mean_test_f1_weighted, 'o')
axes[1,1].set_ylim([0,1])
axes[1,1].set_xlabel('min_samples_leaf')
axes[1,1].set_ylabel('K-fold F1 Score')
axes[2,0].plot(cv_results.param_min_samples_split, cv_results.mean_test_f1_weighted, 'o')
axes[2,0].set_ylim([0,1])
axes[2,0].set_xlabel('min_samples_split')
axes[2,0].set_ylabel('K-fold F1 Score')
axes[2,1].plot(cv_results.param_criterion, cv_results.mean_test_f1_weighted, 'o')
axes[2,1].set_ylim([0,1])
axes[2,1].set_xlabel('criterion')
axes[2,1].set_ylabel('K-fold F1 Score');

Since it was clear that the optimal value of min_samples_split and min_samples_leaf were their default values (2 and 1 respectively), I did not further tune these values. However, I was uncertain about the optimal values of the remaining hyperparameters, so I decided to do a finer grid search, narrowing down the ranges for each hyperparameter based on the above graphs.

# Finer grid search
param_grid = {    
    'criterion':['gini','entropy'],
    'max_depth': range(8,26,2),
    'max_leaf_nodes': range(100,1038,50),
    'max_features': range(1,12,2)
}

# Using 2-fold CV because of the limited number of instances of some of the classes 
skf = StratifiedKFold(n_splits=2)

# Aiming to maximize F1-score
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=1), param_grid, scoring=['f1_weighted','accuracy'], refit= 'f1_weighted', cv=skf, n_jobs=-1, verbose = True)
grid_search.fit(X_train, y_train)

# Make the predictions
y_pred = grid_search.predict(X_test)

print('Train F1-score : %.3f'%grid_search.best_estimator_.score(X_train, y_train))
print('Test F1-score : %.3f'%grid_search.best_estimator_.score(X_test, y_test))
print('Best F1-score Through Grid Search : %.3f'%grid_search.best_score_)

print('Best params for F1-score')
print(grid_search.best_params_)
Fitting 2 folds for each of 2052 candidates, totalling 4104 fits
Train F1-score : 0.857
Test F1-score : 0.545
Best F1-score Through Grid Search : 0.540
Best params for F1-score
{'criterion': 'gini', 'max_depth': 18, 'max_features': 1, 'max_leaf_nodes': 850}
cv_results = pd.DataFrame(grid_search.cv_results_)
fig, axes = plt.subplots(2,2,figsize=(18,15))
plt.subplots_adjust(wspace=0.2)
axes[0,0].plot(cv_results.param_max_depth, cv_results.mean_test_f1_weighted, 'o')
axes[0,0].set_ylim([0.4,0.6])
axes[0,0].set_xlabel('max_depth')
axes[0,0].set_ylabel('K-fold F1 Score')
axes[0,1].plot(cv_results.param_max_leaf_nodes, cv_results.mean_test_f1_weighted, 'o')
axes[0,1].set_ylim([0.4,0.6])
axes[0,1].set_xlabel('max_leaf_nodes')
axes[0,1].set_ylabel('K-fold F1 Score')
axes[1,0].plot(cv_results.param_max_features, cv_results.mean_test_f1_weighted, 'o')
axes[1,0].set_ylim([0.4,0.6])
axes[1,0].set_xlabel('max_features')
axes[1,0].set_ylabel('K-fold F1 Score')
axes[1,1].plot(cv_results.param_criterion, cv_results.mean_test_f1_weighted, 'o')
axes[1,1].set_ylim([0.4,0.6])
axes[1,1].set_xlabel('criterion')
axes[1,1].set_ylabel('K-fold F1 Score');

Since it seemed clear that gini was the optimal criterion, I decided to only further tune the remaining hyperparameters (max_depth, max_leaf_nodes, and max_features) in order to find the optimal hyperparameter values.

# Finer grid search
param_grid = { 
    'max_depth': range(16,26),
    'max_leaf_nodes': range(550,1038),
    'max_features': range(1,5)
}

# Using 2-fold CV because of the limited number of instances of some of the classes 
skf = StratifiedKFold(n_splits=2)

# Aiming to maximize F1-score
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=1), param_grid, scoring=['f1_weighted','accuracy'], refit= 'f1_weighted', cv=skf, n_jobs=-1, verbose = True)
grid_search.fit(X_train, y_train)

# Make the predictions
y_pred = grid_search.predict(X_test)

print('Train F1-score : %.3f'%grid_search.best_estimator_.score(X_train, y_train))
print('Test F1-score : %.3f'%grid_search.best_estimator_.score(X_test, y_test))
print('Best F1-score Through Grid Search : %.3f'%grid_search.best_score_)

print('Best params for F1-score')
print(grid_search.best_params_)
Fitting 2 folds for each of 19520 candidates, totalling 39040 fits
Train F1-score : 0.864
Test F1-score : 0.559
Best F1-score Through Grid Search : 0.540
Best params for F1-score
{'max_depth': 17, 'max_features': 4, 'max_leaf_nodes': 613}

Based on the optimal hyperparameters found through the grid searches, I got an optimal model:

# Decision tree classifier with the optimal parameters
dt_model = DecisionTreeClassifier(random_state=1, max_depth=17, max_features = 4, max_leaf_nodes=613)
dt_model.fit(X_train,y_train)

# Getting the feature importances of the model
feature_importance_df = pd.concat([pd.Series(dt_model.feature_names_in_), pd.Series(dt_model.feature_importances_)], axis = 1)
feature_importance_df.rename(columns={0: "predictors", 1: "feature_importance"}).sort_values(by='feature_importance', ascending=False)
predictors feature_importance
7 density 0.137988
1 volatile acidity 0.109337
10 alcohol 0.104010
3 residual sugar 0.098156
8 pH 0.094459
0 fixed acidity 0.086861
2 citric acid 0.080260
9 sulphates 0.075553
5 free sulfur dioxide 0.075066
6 total sulfur dioxide 0.073783
4 chlorides 0.064526
# Predict the labels of the test set using the optimal model
y_pred = dt_model.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

# Print the confusion matrix
print(confusion_matrix(y_test, y_pred))
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         4
           4       0.21      0.21      0.21        33
           5       0.59      0.61      0.60       291
           6       0.58      0.60      0.59       440
           7       0.52      0.49      0.51       176
           8       0.46      0.34      0.39        35
           9       0.00      0.00      0.00         1

    accuracy                           0.56       980
   macro avg       0.34      0.32      0.33       980
weighted avg       0.56      0.56      0.56       980

[[  0   0   1   3   0   0   0]
 [  0   7  15  10   1   0   0]
 [  1  12 177  90  10   1   0]
 [  1  10  92 265  64   8   0]
 [  0   1   9  74  87   5   0]
 [  0   3   4  11   5  12   0]
 [  0   0   0   0   1   0   0]]
# Compare to the base model
dt_base = DecisionTreeClassifier(random_state=1, criterion = 'entropy').fit(X_train, y_train)
y_pred = dt_base.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

# Print the confusion matrix
print(confusion_matrix(y_test, y_pred))
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         4
           4       0.30      0.30      0.30        33
           5       0.65      0.64      0.64       291
           6       0.67      0.67      0.67       440
           7       0.57      0.57      0.57       176
           8       0.43      0.57      0.49        35
           9       0.00      0.00      0.00         1

    accuracy                           0.62       980
   macro avg       0.38      0.39      0.38       980
weighted avg       0.62      0.62      0.62       980

[[  0   1   0   2   0   1   0]
 [  0  10  13   7   3   0   0]
 [  1  10 185  82   9   4   0]
 [  0   9  73 294  56   8   0]
 [  0   3  11  49 100  13   0]
 [  0   0   1   7   7  20   0]
 [  0   0   0   0   1   0   0]]

Despite tuning, the basic model, dt_base has a better balance of precision, recall, and F1-score for most classes, as well as the highest overall accuracy.

# Final model
dt_model = DecisionTreeClassifier(random_state=1, criterion = 'entropy').fit(X_train, y_train)

3.2 Bagged Decision Trees

By Divya Bhardwaj

#Creating a baseline bagging classifier model

initialmodel = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1)).fit(X_train, y_train)

#Getting the accuracy and classification report for this model
y_pred = initialmodel.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

print(classification_report(y_test, y_pred))
Accuracy: 0.6408163265306123
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         4
           4       0.56      0.30      0.39        33
           5       0.67      0.67      0.67       291
           6       0.63      0.72      0.67       440
           7       0.58      0.49      0.54       176
           8       0.86      0.54      0.67        35
           9       0.00      0.00      0.00         1

    accuracy                           0.64       980
   macro avg       0.47      0.39      0.42       980
weighted avg       0.64      0.64      0.64       980

I chose to optimize based on the F1 score, which is usually better in the case of uneven class distribution due to how it balances precision and recall on the positive class (the right classification).

I tuned max_samples, max_features, bootstrap, and bootstrap_features. For the number of trees, I used a fixed value for the grid searches, then fine tuned it at the end once obtaining the optimal values for the other hyperparameters.

params = {'base_estimator': [DecisionTreeClassifier(random_state = 1)],
          'n_estimators': [300],
          'max_samples': [0.5, 0.75, 1.0],
          'max_features': [0.5, 0.75, 1.0],
          'bootstrap': [True, False],
          'bootstrap_features': [True, False]}

cv = KFold(n_splits=5,shuffle=True,random_state=1)
bagging_classifier_grid = GridSearchCV(BaggingClassifier(random_state=1, n_jobs=-1), 
                                      param_grid =params, cv=cv, n_jobs=-1, verbose=1,
                                      scoring = ['f1_weighted'], refit='f1_weighted')
bagging_classifier_grid.fit(X_train, y_train)

print('Best Parameters : ',bagging_classifier_grid.best_params_)
Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best Parameters :  {'base_estimator': DecisionTreeClassifier(random_state=1), 'bootstrap': False, 'bootstrap_features': True, 'max_features': 1.0, 'max_samples': 0.75, 'n_estimators': 300}

Testing the model with the optimal hyperparameters on X_test:

model = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1), n_estimators=300, 
                          random_state=1,max_features=1.0, max_samples=0.75, bootstrap=False,bootstrap_features=True).fit(X_train, y_train)
y_pred = model.predict(X_test)
print("F1 Score:", f1_score(y_test, y_pred, average='weighted')) #accounting for the classes 
print(classification_report(y_test, y_pred))
F1 Score: 0.6419354491362463
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         4
           4       0.62      0.24      0.35        33
           5       0.70      0.63      0.66       291
           6       0.62      0.79      0.69       440
           7       0.67      0.45      0.54       176
           8       0.95      0.57      0.71        35
           9       0.00      0.00      0.00         1

    accuracy                           0.65       980
   macro avg       0.51      0.38      0.42       980
weighted avg       0.66      0.65      0.64       980
#plotting max samples and max features vs. k-fold f1 accuracy
cv_results = pd.DataFrame(bagging_classifier_grid.cv_results_)
fig, axes = plt.subplots(1,2, figsize=(10,5))
plt.subplots_adjust(wspace=0.2)
axes[0].plot(cv_results.param_max_samples, cv_results.mean_test_f1_weighted, 'o')
axes[0].set_xlabel('max_samples')
axes[0].set_ylabel('K-fold F1 Score')
axes[1].plot(cv_results.param_max_features, cv_results.mean_test_f1_weighted, 'o')
axes[1].set_xlabel('max_features')
axes[1].set_ylabel('K-fold F1 Score')
Text(0, 0.5, 'K-fold F1 Score')

Based on these results, I refined for another gridsearch:

params = {'base_estimator': [DecisionTreeClassifier(random_state = 1)],
          'n_estimators': [300],
          'max_samples': [0.65, 0.75, 0.85],
          'max_features': [0.65, 0.75, 0.85],
          'bootstrap': [True, False],
          'bootstrap_features': [True, False]}

cv = KFold(n_splits=5,shuffle=True,random_state=1)
bagging_classifier_grid = GridSearchCV(BaggingClassifier(random_state=1, n_jobs=-1), 
                                      param_grid =params, cv=cv, n_jobs=-1, verbose=1,
                                      scoring = ['f1_weighted'], refit='f1_weighted')
bagging_classifier_grid.fit(X_train, y_train)

print('Best Parameters : ',bagging_classifier_grid.best_params_)
Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best Parameters :  {'base_estimator': DecisionTreeClassifier(random_state=1), 'bootstrap': False, 'bootstrap_features': False, 'max_features': 0.65, 'max_samples': 0.65, 'n_estimators': 300}

Testing the model with the optimal hyperparameters on X_test:

model = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1), n_estimators=300, 
                          random_state=1,max_features=0.65, max_samples=0.65, bootstrap=False,bootstrap_features=False).fit(X_train, y_train)
y_pred = model.predict(X_test)
print("F1 Score:", f1_score(y_test, y_pred, average='weighted')) #accounting for the classes 
print(classification_report(y_test, y_pred))
F1 Score: 0.6479616424675945
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         4
           4       0.73      0.24      0.36        33
           5       0.71      0.63      0.67       291
           6       0.62      0.80      0.70       440
           7       0.67      0.47      0.56       176
           8       1.00      0.51      0.68        35
           9       0.00      0.00      0.00         1

    accuracy                           0.66       980
   macro avg       0.53      0.38      0.42       980
weighted avg       0.67      0.66      0.65       980
#plotting max samples and max features vs. k-fold f1 accuracy
cv_results = pd.DataFrame(bagging_classifier_grid.cv_results_)
fig, axes = plt.subplots(1,2, figsize=(10,5))
plt.subplots_adjust(wspace=0.2)
axes[0].plot(cv_results.param_max_samples, cv_results.mean_test_f1_weighted, 'o')
axes[0].set_xlabel('max_samples')
axes[0].set_ylabel('K-fold F1 Score')
axes[1].plot(cv_results.param_max_features, cv_results.mean_test_f1_weighted, 'o')
axes[1].set_xlabel('max_features')
axes[1].set_ylabel('K-fold F1 Score')
Text(0, 0.5, 'K-fold F1 Score')

Based on these results, I refined for another gridsearch:

params = {'base_estimator': [DecisionTreeClassifier(random_state = 1)],
          'n_estimators': [300],
          'max_samples': [0.6, 0.65, 0.7],
          'max_features': [0.6, 0.65, 0.7],
          'bootstrap': [True, False],
          'bootstrap_features': [True, False]}

cv = KFold(n_splits=5,shuffle=True,random_state=1)
bagging_classifier_grid = GridSearchCV(BaggingClassifier(random_state=1, n_jobs=-1), 
                                      param_grid =params, cv=cv, n_jobs=-1, verbose=1,
                                      scoring = ['f1_weighted'], refit='f1_weighted')
bagging_classifier_grid.fit(X_train, y_train)

print('Best Parameters : ',bagging_classifier_grid.best_params_)
Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best Parameters :  {'base_estimator': DecisionTreeClassifier(random_state=1), 'bootstrap': False, 'bootstrap_features': False, 'max_features': 0.65, 'max_samples': 0.65, 'n_estimators': 300}

This gave the same model as previously, so I decided to use the actual values instead of proportions to tune more finely. In the case of max features, the previous gridsearches’ optimal values all rounded to 7, so I fixed that value.

params = {'base_estimator': [DecisionTreeClassifier(random_state = 1)],
          'n_estimators': [300],
          'max_samples': [2400, 2500, 2600],
          'max_features': [7],
          'bootstrap': [True, False],
          'bootstrap_features': [True, False]}

cv = KFold(n_splits=5,shuffle=True,random_state=1)
bagging_classifier_grid = GridSearchCV(BaggingClassifier(random_state=1, n_jobs=-1), 
                                      param_grid =params, cv=cv, n_jobs=-1, verbose=1,
                                      scoring = ['f1_weighted'], refit='f1_weighted')
bagging_classifier_grid.fit(X_train, y_train)

print('Best Parameters : ',bagging_classifier_grid.best_params_)
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best Parameters :  {'base_estimator': DecisionTreeClassifier(random_state=1), 'bootstrap': False, 'bootstrap_features': False, 'max_features': 7, 'max_samples': 2500, 'n_estimators': 300}
#plotting max samples vs. k-fold f1 accuracy
cv_results = pd.DataFrame(bagging_classifier_grid.cv_results_)
plt.plot(cv_results.param_max_samples, cv_results.mean_test_f1_weighted, 'o')
plt.xlabel('max_samples')
plt.ylabel('K-fold F1 Score')
Text(0, 0.5, 'K-fold F1 Score')

Testing the optimal model on X_test, increasing the number of trees manually to find the optimal value since the rest of the hyperparameters are now fixed:

model = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1), n_estimators=650, 
                          random_state=1,max_features= 7, max_samples=2500, bootstrap=False,bootstrap_features=False).fit(X_train, y_train)
y_pred = model.predict(X_test)
print("F1 Score:", f1_score(y_test, y_pred, average='weighted')) #accounting for the classes 
print(classification_report(y_test, y_pred))
F1 Score: 0.6505479910297426
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         4
           1       0.64      0.21      0.32        33
           2       0.70      0.64      0.67       291
           3       0.62      0.80      0.70       440
           4       0.69      0.47      0.56       176
           5       1.00      0.54      0.70        35
           6       0.00      0.00      0.00         1

    accuracy                           0.66       980
   macro avg       0.52      0.38      0.42       980
weighted avg       0.67      0.66      0.65       980

This is the overall optimal bagging model in terms of F1 score. Its overall accuracy and precision/recall balance are also optimal for most classes. Looking at its feature importances:

total = [0] * 11
for i, temp_model in enumerate(model.estimators_):
    feature = model.estimators_features_[i] 
    for j, predictor in enumerate(feature):
        total[predictor] += temp_model.feature_importances_[j]
feature_importances = np.array(total)/len(model.estimators_)
importance_df = pd.concat([pd.Series(model.feature_names_in_), pd.Series(feature_importances)], axis = 1)
importance_df.sort_values(by = 1, ascending = False)
0 1
10 alcohol 0.114615
7 density 0.108228
6 total sulfur dioxide 0.093421
8 pH 0.092443
5 free sulfur dioxide 0.091525
1 volatile acidity 0.091202
3 residual sugar 0.090663
4 chlorides 0.084718
9 sulphates 0.080082
2 citric acid 0.079759
0 fixed acidity 0.073344

For the purposes of the ensembled model, I converted max_samples to a decimal so that it could be used in the stacking ensembled model.

2500/(X_train.shape[0])
0.6380806533945891
# Final model
bg_model = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1), n_estimators=650, 
                          random_state=1,max_features= 7, max_samples=0.6380806533945891, bootstrap=False,bootstrap_features=False).fit(X_train, y_train)

3.3 Random Forest

By Diego Schummer

# Create a Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model using the training sets
rf.fit(X_train, y_train.values.ravel())

# Predict the response for the test dataset
y_pred = rf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:", accuracy_score(y_test, y_pred))

# Detailed report of classification done by the model
print(classification_report(y_test, y_pred))
Accuracy: 0.6755102040816326
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         4
           4       0.70      0.21      0.33        33
           5       0.71      0.66      0.68       291
           6       0.64      0.82      0.72       440
           7       0.71      0.48      0.58       176
           8       0.95      0.54      0.69        35
           9       0.00      0.00      0.00         1

    accuracy                           0.68       980
   macro avg       0.53      0.39      0.43       980
weighted avg       0.68      0.68      0.66       980

In the parameter grid, I’ve added ‘class_weight’ to handle the class imbalance. It adjusts the cost function to give more weight to under-represented classes and less weight to over-represented ones. ‘balanced’ and ‘balanced_subsample’ automatically adjust weights inversely proportional to class frequencies. The ‘None’ option is added to compare the performance with and without class weighting.

### Tuning

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30, 40],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10],
    'class_weight' : ['balanced', 'balanced_subsample', None]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=2, scoring='accuracy')

# Fit the GridSearch to our training data
grid_search.fit(X_train, y_train.values.ravel())

print("Best parameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)

# Create a new Random Forest classifier with the best parameters
rf_best = RandomForestClassifier(**grid_search.best_params_, random_state=42)

# Fit the model on the training data
rf_best.fit(X_train, y_train.values.ravel())

# Predict the labels of the test set
y_pred = rf_best.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

# Print the confusion matrix
print(confusion_matrix(y_test, y_pred))
Best parameters:  {'class_weight': None, 'criterion': 'entropy', 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}
Best score:  0.6347626339969372
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         4
           4       0.67      0.24      0.36        33
           5       0.71      0.65      0.68       291
           6       0.63      0.79      0.70       440
           7       0.67      0.49      0.57       176
           8       1.00      0.54      0.70        35
           9       0.00      0.00      0.00         1

    accuracy                           0.66       980
   macro avg       0.53      0.39      0.43       980
weighted avg       0.67      0.66      0.66       980

[[  0   0   1   3   0   0   0]
 [  0   8  16   9   0   0   0]
 [  0   3 189  99   0   0   0]
 [  0   1  58 349  32   0   0]
 [  0   0   3  87  86   0   0]
 [  0   0   0   7   9  19   0]
 [  0   0   0   0   1   0   0]]
#Feature Importance
feature_importances = pd.DataFrame(rf_best.feature_importances_,
                                      index = X_train.columns,                          
                                        columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)
  • Class 6: Similar to class 5, the precision and recall are around 0.60. This class seems to be the best predicted by the model.
  • The accuracy of the model is 0.56 or 56%, which is the proportion of true results (both true positives and true negatives) in the population.
  • the model is performing decently for classes 5, 6, and 7 but poorly for the other classes.
# Create a new Random Forest classifier with the best parameters
rf_best = RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=20, min_samples_leaf=1,
                                 min_samples_split=2, class_weight=None, random_state=42)

# Fit the model on the training data
rf_best.fit(X_train, y_train.values.ravel())

# Predict the labels of the test set
y_pred = rf_best.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

# Print the confusion matrix
print(confusion_matrix(y_test, y_pred))
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         4
           4       0.64      0.21      0.32        33
           5       0.70      0.66      0.68       291
           6       0.64      0.80      0.71       440
           7       0.72      0.49      0.59       176
           8       1.00      0.57      0.73        35
           9       0.00      0.00      0.00         1

    accuracy                           0.67       980
   macro avg       0.53      0.39      0.43       980
weighted avg       0.68      0.67      0.66       980

[[  0   0   1   3   0   0   0]
 [  0   7  18   8   0   0   0]
 [  0   3 191  96   1   0   0]
 [  0   1  60 354  25   0   0]
 [  0   0   2  87  87   0   0]
 [  0   0   1   8   6  20   0]
 [  0   0   0   0   1   0   0]]
#comapre to base model

#create base model
rf_base = RandomForestClassifier(n_estimators=100, random_state=42)

#fit base model
rf_base.fit(X_train, y_train.values.ravel())

#predict base model
y_pred = rf_base.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         4
           4       0.70      0.21      0.33        33
           5       0.71      0.66      0.68       291
           6       0.64      0.82      0.72       440
           7       0.71      0.48      0.58       176
           8       0.95      0.54      0.69        35
           9       0.00      0.00      0.00         1

    accuracy                           0.68       980
   macro avg       0.53      0.39      0.43       980
weighted avg       0.68      0.68      0.66       980
best_parameters = {'class_weight': None, 'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}

#fit model
rf_model = RandomForestClassifier(**best_parameters, random_state=42)

#fit model
rf_model.fit(X_train, y_train.values.ravel())

#predict model
y_pred = rf_model.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         4
           4       0.67      0.24      0.36        33
           5       0.69      0.65      0.67       291
           6       0.62      0.78      0.69       440
           7       0.67      0.47      0.55       176
           8       1.00      0.54      0.70        35
           9       0.00      0.00      0.00         1

    accuracy                           0.66       980
   macro avg       0.52      0.38      0.42       980
weighted avg       0.66      0.66      0.65       980

THe base model, rf_base has a better balance of precision, recall, and F1-score for most classes, as well as the highest overall accuracy.

print("RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=None, min_samples_leaf=1, min_samples_split=2, class_weight=None, random_state=42)")
RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=None, min_samples_leaf=1, min_samples_split=2, class_weight=None, random_state=42)

3.4 XGBoost

By Yasmeen Nahas

In order to take class imbalance into account, I will be implementing an algorithm that balances the class weights of the individual classes in the XGBoost model.

First the different classes need to be binarized for the model to evaluate it as multiclass response variables.

# mapping the variables from 0 to 7 (the model does not accept the standard format)

num = {3: 0,
       4: 1,
       5: 2,
       6: 3,
       7: 4,       
       8: 5,      
       9: 6}            

y_train = y_train.map(num)
y_test = y_test.map(num)
# taking weights into account using an algorithm that rebalances the weights

def CreateBalancedSampleWeights(y_train, largest_class_weight_coef):
    # taking the unique classes
    classes = np.unique(y_train, axis = 0)
    # sorting the classes depending on the number of observations in ascending order
    classes.sort()
    #This line uses the np.bincount() function to count the number of occurrences of each class in the y_train array. The resulting counts are stored in the class_samples array.
    class_samples = np.bincount(y_train)
    # The total number of samples in y_train is calculated by summing up all the counts in class_samples.
    total_samples = class_samples.sum()
    # The variable n_classes stores the number of unique classes found in y_train.
    n_classes = len(class_samples)
    # This line calculates the weight for each class by dividing the total number of samples by the product of the number of classes, the class samples, and 1.0 (used to ensure floating-point division).
    weights = total_samples / (n_classes * class_samples * 1.0)
    # The class_weight_dict dictionary is created, mapping each class to its corresponding weight. This is done by pairing up the classes array with the weights array using the zip() function.
    class_weight_dict = {key : value for (key, value) in zip(classes, weights)}
    # The weight of the second class (index 1 in classes) is multiplied by largest_class_weight_coef and updated in the class_weight_dict.
    class_weight_dict[classes[1]] = class_weight_dict[classes[1]] * largest_class_weight_coef
    # This line creates a list called sample_weights by iterating over each value y in y_train and retrieving the corresponding weight from class_weight_dict.
    sample_weights = [class_weight_dict[y] for y in y_train]
    return sample_weights
classes = np.unique(y_train, axis = 0)
# sorting the classes depending on the number of observations in ascending order
classes.sort()
#This line uses the np.bincount() function to count the number of occurrences of each class in the y_train array. The resulting counts are stored in the class_samples array.
class_samples = np.bincount(y_train)
# The total number of samples in y_train is calculated by summing up all the counts in class_samples.
total_samples = class_samples.sum()
# The variable n_classes stores the number of unique classes found in y_train.
n_classes = len(class_samples)
# This line calculates the weight for each class by dividing the total number of samples by the product of the number of classes, the class samples, and 1.0 (used to ensure floating-point division).
weights = total_samples / (n_classes * class_samples * 1.0)
class_weight_dict = {key : value for (key, value) in zip(classes, weights)}
class_weight_dict
{0: 34.982142857142854,
 1: 4.305494505494505,
 2: 0.48002940455770643,
 3: 0.31838127742564604,
 4: 0.7950487012987013,
 5: 3.9979591836734696,
 6: 139.92857142857142}
# identifying the largert class coefficient
largest_class_weight_coef = max(y_train.value_counts().values)/y_train.shape[0]
    
#pass y_train as numpy array
weight = CreateBalancedSampleWeights(y_train, largest_class_weight_coef)
classes = np.unique(y_train, axis = 0)
# sorting the classes depending on the number of observations in ascending order
classes.sort()
#This line uses the np.bincount() function to count the number of occurrences of each class in the y_train array. The resulting counts are stored in the class_samples array.
class_samples = np.bincount(y_train)
# The total number of samples in y_train is calculated by summing up all the counts in class_samples.
total_samples = class_samples.sum()
# The variable n_classes stores the number of unique classes found in y_train.
n_classes = len(class_samples)
# This line calculates the weight for each class by dividing the total number of samples by the product of the number of classes, the class samples, and 1.0 (used to ensure floating-point division).
weights = total_samples / (n_classes * class_samples * 1.0)
class_weight_dict = {key : value for (key, value) in zip(classes, weights)}
class_weight_dict
{0: 34.982142857142854,
 1: 4.305494505494505,
 2: 0.48002940455770643,
 3: 0.31838127742564604,
 4: 0.7950487012987013,
 5: 3.9979591836734696,
 6: 139.92857142857142}
class_weight_dict[classes[1]] = class_weight_dict[classes[1]] * largest_class_weight_coef
sample_weights = [class_weight_dict[y] for y in y_train]

The algorithm adjusts the weight of each variable accordingly. Next, I use a base model as a starting point to tune my model:

# Create XGBoost classifier with the custom evaluation metric
model = xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          eval_metric=f1_score,
                          seed=45) 

model.fit(X_train, y_train, sample_weight = weight)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False,
              eval_metric=<function f1_score at 0x7fc20a768820>,
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=100, n_jobs=None,
              num_class=7, num_parallel_tree=None, objective='multi:softmax', ...)
pd.concat([pd.Series(X_train.columns, name = 'predictor'), 
           pd.Series(model.feature_importances_, 
                     name = 'importance')], axis = 1).sort_values(by = 'importance', ascending=False)
model.score(X_test, y_test)
0.6275510204081632

The F1 score for the base model is of 64.3%. Next, I tune the hyperparameters.

Multiclass:soft is being used as the objective as we are looking at multiple classes. I also specified the num_class (7). eval_metric set to f1_weighted for the cross_val_score argument. I looked at each hyperparameter to find the optimal ranges:

3.4.1 1. Number of Trees

# Coarse grid search
def get_models():
    models = dict()
    # define number of trees to consider
    n_trees = [5, 10, 50, 100, 500, 1000, 2000, 5000]
    for n in n_trees:
        models[str(n)] = xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          eval_metric=f1_score,
                          n_estimators=n,
                          random_state = 1)
    return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
    # define the evaluation procedure
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    fit_params = {"sample_weight": sample_weights}
    # evaluate the model and collect the results
    scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
    return scores

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X_train, y_train, weight)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('Number of trees',fontsize=15)
>5 0.467 (0.015)
>10 0.509 (0.008)
>50 0.613 (0.016)
>100 0.639 (0.017)
>500 0.634 (0.019)
>1000 0.631 (0.017)
>2000 0.626 (0.016)
>5000 0.623 (0.015)
Text(0.5, 0, 'Number of trees')

# Finer grid search
def get_models():
    models = dict()
    # define number of trees to consider
    n_trees = [60,  80,  100, 150, 200]
    for n in n_trees:
        models[str(n)] = xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          eval_metric=f1_score,
                          n_estimators=n,
                          random_state = 1)
    return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
    # define the evaluation procedure
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    fit_params = {"sample_weight": sample_weights}
    # evaluate the model and collect the results
    scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
    return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X_train, y_train, weight)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('Number of trees',fontsize=15)
>60 0.620 (0.020)
>80 0.633 (0.020)
>100 0.639 (0.017)
>150 0.634 (0.016)
>200 0.635 (0.018)
Text(0.5, 0, 'Number of trees')

An ideal range for the number of trees to be used in the model is around [150].

3.4.2 2. Depth

# Coarse grid search

# get a list of models to evaluate
def get_models():
    models = dict()
    # explore depths from 1 to 10
    depth = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
    for n in depth:        
        models[str(n)] = xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          eval_metric=f1_score,
                          max_depth=n,
                          random_state = 1)
    return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
    # define the evaluation procedure
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    fit_params = {"sample_weight": sample_weights}
    # evaluate the model and collect the results
    scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
    return scores

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X_train, y_train, weight)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('Depth',fontsize=15)
>3 0.546 (0.008)
>4 0.584 (0.005)
>5 0.617 (0.013)
>6 0.639 (0.017)
>7 0.636 (0.014)
>8 0.636 (0.018)
>9 0.634 (0.011)
>10 0.640 (0.017)
>11 0.643 (0.013)
>12 0.648 (0.016)
>13 0.636 (0.018)
Text(0.5, 0, 'Depth')

# Finer grid search

# get a list of models to evaluate
def get_models():
    models = dict()
    # explore depths from 1 to 10
    depth = [6, 7, 8, 9, 10]
    for n in depth:        
        models[str(n)] = xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          eval_metric=f1_score,
                          max_depth=n,
                          random_state = 1)
    return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
    # define the evaluation procedure
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    fit_params = {"sample_weight": sample_weights}
    # evaluate the model and collect the results
    scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
    return scores

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X_train, y_train, weight)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('Depth',fontsize=15)
>6 0.639 (0.017)
>7 0.636 (0.014)
>8 0.636 (0.018)
>9 0.634 (0.011)
>10 0.640 (0.017)
Text(0.5, 0, 'Depth')

The ideal range of the depth of trees that seems ideal is [7, 9, 10].

3.4.3 3. Learning Rate

# Coarse grid search

# get a list of models to evaluate
def get_models():
    models = dict()
    for i in [0.01,0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.8,1.0]:
        key = '%.4f' % i
        models[key] = xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          eval_metric=f1_score,
                          learning_rate= i,
                          random_state = 1)
    return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
    # define the evaluation procedure
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    fit_params = {"sample_weight": sample_weights}
    # evaluate the model and collect the results
    scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
    return scores

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X_train, y_train, weight)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('Learning Rate',fontsize=15)
>0.0100 0.459 (0.019)
>0.0500 0.538 (0.018)
>0.1000 0.581 (0.016)
>0.2000 0.615 (0.012)
>0.3000 0.639 (0.017)
>0.4000 0.628 (0.017)
>0.5000 0.629 (0.017)
>0.6000 0.627 (0.011)
>0.8000 0.625 (0.014)
>1.0000 0.631 (0.014)
Text(0.5, 0, 'Learning Rate')

# Finer grid search

# get a list of models to evaluate
def get_models():
    models = dict()
    for i in [0.1, 0.15, 0.2, 0.25,0.3, 0.35, 0.4]:
        key = '%.4f' % i
        models[key] = xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          eval_metric=f1_score,
                          learning_rate= i,
                          random_state = 1)
    return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
    # define the evaluation procedure
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    fit_params = {"sample_weight": sample_weights}
    # evaluate the model and collect the results
    scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
    return scores

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X_train, y_train, weight)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('Learning Rate',fontsize=15)
>0.1000 0.581 (0.016)
>0.1500 0.604 (0.019)
>0.2000 0.615 (0.012)
>0.2500 0.621 (0.016)
>0.3000 0.639 (0.017)
>0.3500 0.625 (0.015)
>0.4000 0.628 (0.017)
Text(0.5, 0, 'Learning Rate')

The learning rates of [0.2, 0.3, 0.35] seems ideal.

3.4.4 4. Reg Lambda

# Coarse grid search
def get_models():
    models = dict()
    for i in [0,0.5,1.0,1.5,2,10,100, 500, 1000]:
        key = '%.2f' % i
        models[key] = xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          eval_metric=f1_score,
                          reg_lambda=i,
                          random_state=1)
    return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
    # define the evaluation procedure
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    fit_params = {"sample_weight": sample_weights}
    # evaluate the model and collect the results
    scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
    return scores

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X_train, y_train, weight)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('reg_lambda',fontsize=15)
>0.00 0.632 (0.018)
>0.50 0.627 (0.017)
>1.00 0.639 (0.017)
>1.50 0.626 (0.009)
>2.00 0.628 (0.024)
>10.00 0.615 (0.014)
>100.00 0.562 (0.011)
>500.00 0.500 (0.015)
>1000.00 0.472 (0.020)
Text(0.5, 0, 'reg_lambda')

# Finer grid search
def get_models():
    models = dict()
    for i in [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 5.0, 10]:
        key = '%.1f' % i
        models[key] = xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          eval_metric=f1_score,
                          reg_lambda=i,
                          random_state=1)
    return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
    # define the evaluation procedure
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    fit_params = {"sample_weight": sample_weights}
    # evaluate the model and collect the results
    scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
    return scores

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X_train, y_train, weight)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('reg_lambda',fontsize=15)
>0.5 0.627 (0.017)
>1.0 0.639 (0.017)
>1.5 0.626 (0.009)
>2.0 0.628 (0.024)
>2.5 0.626 (0.009)
>3.0 0.627 (0.010)
>5.0 0.625 (0.009)
>10.0 0.615 (0.014)
Text(0.5, 0, 'reg_lambda')

A reg lambda value of [1.0, 2.5, 3.0] seem ideal.

3.4.5 5. Gamma

# Coarse grid search
def get_models():
    models = dict()
    for i in [0,10,1e2,1e3,1e4,1e5,1e6,1e7,1e8,1e9]:
        key = '%.0f' % i
        models[key] = xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          eval_metric=f1_score,
                          gamma=i,
                          random_state=1)
    return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
    # define the evaluation procedure
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    fit_params = {"sample_weight": sample_weights}
    # evaluate the model and collect the results
    scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
    return scores

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X_train, y_train, weight)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('gamma',fontsize=15)
>0 0.639 (0.017)
>10 0.439 (0.030)
>100 0.285 (0.066)
>1000 0.010 (0.018)
>10000 0.010 (0.018)
>100000 0.010 (0.018)
>1000000 0.010 (0.018)
>10000000 0.010 (0.018)
>100000000 0.010 (0.018)
>1000000000 0.010 (0.018)
Text(0.5, 0, 'gamma')

# Finer grid search
def get_models():
    models = dict()
    for i in [0, 1, 2, 3, 4, 5]:
        key = '%.1f' % i
        models[key] = xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          eval_metric=f1_score,
                          gamma=i,
                          random_state=1)
    return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
    # define the evaluation procedure
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    fit_params = {"sample_weight": sample_weights}
    # evaluate the model and collect the results
    scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
    return scores

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X_train, y_train, weight)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('gamma',fontsize=15)
>0.0 0.639 (0.017)
>1.0 0.567 (0.009)
>2.0 0.522 (0.015)
>3.0 0.500 (0.015)
>4.0 0.475 (0.014)
>5.0 0.466 (0.010)
Text(0.5, 0, 'gamma')

# Finer grid search
def get_models():
    models = dict()
    for i in [0, 0.1, 0.2, 0.3, 0.4, 0.5]:
        key = '%.1f' % i
        models[key] = xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          eval_metric=f1_score,
                          gamma=i,
                          random_state=1)
    return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
    # define the evaluation procedure
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    fit_params = {"sample_weight": sample_weights}
    # evaluate the model and collect the results
    scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
    return scores

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X_train, y_train, weight)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('gamma',fontsize=15)
>0.0 0.639 (0.017)
>0.1 0.630 (0.014)
>0.2 0.629 (0.018)
>0.3 0.617 (0.017)
>0.4 0.606 (0.015)
>0.5 0.598 (0.009)
Text(0.5, 0, 'gamma')

# Finer grid search
def get_models():
    models = dict()
    for i in [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]:
        key = '%.3f' % i
        models[key] = xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          eval_metric=f1_score,
                          gamma=i,
                          random_state=1)
    return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
    # define the evaluation procedure
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    fit_params = {"sample_weight": sample_weights}
    # evaluate the model and collect the results
    scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
    return scores

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X_train, y_train, weight)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('gamma',fontsize=15)
>0.000 0.639 (0.017)
>0.010 0.629 (0.022)
>0.020 0.634 (0.015)
>0.030 0.631 (0.014)
>0.040 0.624 (0.013)
>0.050 0.629 (0.017)
>0.060 0.630 (0.018)
>0.070 0.637 (0.011)
>0.080 0.630 (0.014)
>0.090 0.631 (0.020)
>0.100 0.630 (0.014)
Text(0.5, 0, 'gamma')

Gamma [0.07, 0.08] seems ideal.

3.4.6 6. Subsample

# Coarse grid search
def get_models():
    models = dict()
    sub =  [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
    for s in sub:
        key = '%.2f' % s
        models[key] = xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          eval_metric=f1_score,
                          subsample = s,
                          random_state=1)
    return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
    # define the evaluation procedure
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    fit_params = {"sample_weight": sample_weights}
    # evaluate the model and collect the results
    scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
    return scores

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X_train, y_train, weight)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('subsample',fontsize=15)
>0.10 0.555 (0.011)
>0.20 0.596 (0.022)
>0.30 0.609 (0.015)
>0.40 0.627 (0.009)
>0.50 0.623 (0.015)
>0.60 0.634 (0.019)
>0.70 0.636 (0.006)
>0.80 0.635 (0.013)
>0.90 0.635 (0.010)
>1.00 0.639 (0.017)
Text(0.5, 0, 'subsample')

# Finer grid search
def get_models():
    models = dict()
    sub =  [0.7, 0.75, 0.8, 0.85, 0.9]
    for s in sub:
        key = '%.2f' % s
        models[key] = xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          eval_metric=f1_score,
                          subsample = s,
                          random_state=1)
    return models

# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
    # define the evaluation procedure
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    fit_params = {"sample_weight": sample_weights}
    # evaluate the model and collect the results
    scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
    return scores

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X_train, y_train, weight)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('subsample',fontsize=15)
>0.70 0.636 (0.006)
>0.75 0.637 (0.013)
>0.80 0.635 (0.013)
>0.85 0.633 (0.020)
>0.90 0.635 (0.010)
Text(0.5, 0, 'subsample')

Seems like the optimal range for coarse subsample is [0.7, 0.85, 0.9]

param_grid = {'n_estimators':[80],
              'max_depth': [8, 9],
              'learning_rate': [0.2, 0.3, 0.35],
              'gamma': [0.07, 0.08],
              'reg_lambda':[1.0, 2.5, 3.0],
              'subsample': [0.7, 0.85, 0.9]}

cv = StratifiedKFold(n_splits=5,shuffle=True,random_state=1)
optimal_params = GridSearchCV(estimator=xgb.XGBClassifier(objective='multi:softmax', 
                          num_class=7,
                          random_state=1),
                          param_grid = param_grid,
                          scoring = 'f1_weighted',
                          verbose = 1,
                          n_jobs= -1,
                          cv = cv)

optimal_params.fit(X_train, y_train, sample_weight = weight)
print(optimal_params.best_params_,optimal_params.best_score_)
Fitting 5 folds for each of 108 candidates, totalling 540 fits
{'gamma': 0.07, 'learning_rate': 0.2, 'max_depth': 9, 'n_estimators': 80, 'reg_lambda': 1.0, 'subsample': 0.7} 0.648089065194428

Further tuning was done to obtain the optimal hyperparameters:

xgb_model = xgb.XGBClassifier(objective = 'multi:softmax', random_state = 1, gamma = 0.06, learning_rate = 0.3, max_depth = 9,
                          n_estimators = 150, reg_lambda = 1.0, subsample = 0.8, num_class= 7, eval_metric=f1_score)
xgb_model.fit(X_train, y_train, sample_weight = weight)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False,
              eval_metric=<function f1_score at 0x7fc20a768820>,
              feature_types=None, gamma=0.06, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.3, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=9,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=150, n_jobs=None,
              num_class=7, num_parallel_tree=None, objective='multi:softmax', ...)
y_pred = xgb_model.predict(X_test)

# Print the precision and recall, among other metrics
print(classification_report(y_test, y_pred, digits=3))

# Print the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Metrics on Test Data
print("Accuracy on Test Data: ", accuracy_score(y_test, y_pred))
print("Precision on Test Data: ", precision_score(y_test, y_pred, average='weighted'))
print("Recall on Test Data: ", recall_score(y_test, y_pred, average='weighted'))
print("F1-score on Test Data: ", f1_score(y_test, y_pred, average='weighted'))
              precision    recall  f1-score   support

           0      0.000     0.000     0.000         4
           1      0.444     0.364     0.400        33
           2      0.684     0.646     0.664       291
           3      0.664     0.723     0.692       440
           4      0.600     0.580     0.590       176
           5      0.741     0.571     0.645        35
           6      0.000     0.000     0.000         1

    accuracy                          0.653       980
   macro avg      0.448     0.412     0.427       980
weighted avg      0.650     0.653     0.650       980

[[  0   1   1   2   0   0   0]
 [  0  12  14   6   1   0   0]
 [  1   8 188  85   9   0   0]
 [  0   5  65 318  49   3   0]
 [  0   1   6  62 102   4   1]
 [  0   0   1   6   8  20   0]
 [  0   0   0   0   1   0   0]]
Accuracy on Test Data:  0.6530612244897959
Precision on Test Data:  0.6502442182752354
Recall on Test Data:  0.6530612244897959
F1-score on Test Data:  0.650376342395648

After tuning and fitting the adjusted weight of the model, the F1 score improved to 65.3% (adjusting the sample weight increased the model performance from 60.6% to 65.3%).

pd.concat([pd.Series(X_train.columns, name = 'predictor'), 
           pd.Series(xgb_model.feature_importances_, 
                     name = 'importance')], axis = 1).sort_values(by = 'importance', ascending=False)
predictor importance
10 alcohol 0.154100
0 fixed acidity 0.138364
5 free sulfur dioxide 0.102930
4 chlorides 0.099804
8 pH 0.087954
1 volatile acidity 0.083195
7 density 0.077974
3 residual sugar 0.076811
6 total sulfur dioxide 0.073330
2 citric acid 0.058400
9 sulphates 0.047138

Alcohol is the most important predictor for the XGBoost model.

4 Model Ensemble

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text (in a markdown cell) before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. Put the name of the person / persons who contributed to each code chunk / set of code chunks.

4.1 Voting ensemble - Hard Voting

By Aarti Pappu

Ensembling all 4 developed models:

#Voting ensemble: Averaging the predictions of all models
voting_ensemble = VotingClassifier(estimators = [('dt',dt_model),('bg',bg_model),('rf',rf_model),('xgb',xgb_model)])
voting_ensemble.fit(X_train,y_train)
voting_pred = voting_ensemble.predict(X_test)
print(classification_report(y_test, voting_pred, digits=3))
print(confusion_matrix(y_test, voting_pred))
              precision    recall  f1-score   support

           3      0.000     0.000     0.000         4
           4      0.615     0.242     0.348        33
           5      0.684     0.677     0.680       291
           6      0.634     0.789     0.703       440
           7      0.717     0.460     0.561       176
           8      1.000     0.543     0.704        35
           9      0.000     0.000     0.000         1

    accuracy                          0.665       980
   macro avg      0.522     0.387     0.428       980
weighted avg      0.673     0.665     0.655       980

[[  0   0   1   3   0   0   0]
 [  0   8  17   8   0   0   0]
 [  0   3 197  91   0   0   0]
 [  0   1  68 347  24   0   0]
 [  0   1   4  90  81   0   0]
 [  0   0   1   8   7  19   0]
 [  0   0   0   0   1   0   0]]

A logistic model was added to see if it would decorrelate the predictions (since our models are all tree-based) and increase the precision, recall, and/or accuracy.

log_model = LogisticRegression(solver = 'newton-cg', max_iter = 1000).fit(X_train, y_train)

#Voting ensemble: Averaging the predictions of all models
voting_ensemble = VotingClassifier(estimators = [('dt',dt_model),('bg',bg_model),('rf',rf_model),('xgb',xgb_model),('log',log_model)])
voting_ensemble.fit(X_train,y_train)
voting_pred = voting_ensemble.predict(X_test)
print(classification_report(y_test, voting_pred, digits=3))
print(confusion_matrix(y_test, voting_pred))
              precision    recall  f1-score   support

           3      0.000     0.000     0.000         4
           4      0.615     0.242     0.348        33
           5      0.701     0.653     0.676       291
           6      0.641     0.811     0.716       440
           7      0.708     0.483     0.574       176
           8      1.000     0.543     0.704        35
           9      0.000     0.000     0.000         1

    accuracy                          0.672       980
   macro avg      0.524     0.390     0.431       980
weighted avg      0.680     0.672     0.662       980

[[  0   0   1   3   0   0   0]
 [  0   8  17   8   0   0   0]
 [  0   4 190  97   0   0   0]
 [  0   1  56 357  26   0   0]
 [  0   0   6  85  85   0   0]
 [  0   0   1   7   8  19   0]
 [  0   0   0   0   1   0   0]]

The voting ensemble model’s precision increased slightly with the addition of the logistic regression model.

4.2 Stacking ensemble(s)

By Aarti Pappu

#Using Logistic regression as the meta model (final_estimator)
stacking_ensemble = StackingClassifier(estimators=[('dt',dt_model),('bg',bg_model),('rf',rf_model),('xgb',xgb_model)],
                                   final_estimator=LogisticRegression(random_state=1,max_iter=10000),n_jobs=-1,
                                   cv = StratifiedKFold(n_splits=2,shuffle=True,random_state=1))
stacking_ensemble.fit(X_train,y_train)
stacking_pred = stacking_ensemble.predict(X_test)
print(classification_report(y_test, stacking_pred, digits=3))
print(confusion_matrix(y_test, stacking_pred))
              precision    recall  f1-score   support

           3      0.000     0.000     0.000         4
           4      0.688     0.333     0.449        33
           5      0.706     0.653     0.679       291
           6      0.637     0.789     0.705       440
           7      0.687     0.511     0.586       176
           8      1.000     0.543     0.704        35
           9      0.000     0.000     0.000         1

    accuracy                          0.670       980
   macro avg      0.531     0.404     0.446       980
weighted avg      0.678     0.670     0.663       980

[[  0   0   2   2   0   0   0]
 [  0  11  15   7   0   0   0]
 [  0   3 190  98   0   0   0]
 [  0   2  59 347  32   0   0]
 [  0   0   3  83  90   0   0]
 [  0   0   0   8   8  19   0]
 [  0   0   0   0   1   0   0]]

A logistic model was added to see if it would decorrelate the predictions (since our models are all tree-based) and increase the precision, recall, and/or accuracy.

#Using Logistic regression as the meta model (final_estimator)
stacking_ensemble = StackingClassifier(estimators=[('dt',dt_model),('bg',bg_model),('rf',rf_model),('xgb',xgb_model), ('log', log_model)],
                                   final_estimator=LogisticRegression(random_state=1,max_iter=10000),n_jobs=-1,
                                   cv = StratifiedKFold(n_splits=2,shuffle=True,random_state=1))
stacking_ensemble.fit(X_train,y_train)
stacking_pred = stacking_ensemble.predict(X_test)
print(classification_report(y_test, stacking_pred, digits=3))
print(confusion_matrix(y_test, stacking_pred))
              precision    recall  f1-score   support

           3      0.000     0.000     0.000         4
           4      0.647     0.333     0.440        33
           5      0.703     0.660     0.681       291
           6      0.636     0.780     0.701       440
           7      0.682     0.511     0.584       176
           8      1.000     0.543     0.704        35
           9      0.000     0.000     0.000         1

    accuracy                          0.668       980
   macro avg      0.524     0.404     0.444       980
weighted avg      0.675     0.668     0.662       980

[[  0   0   2   2   0   0   0]
 [  0  11  15   7   0   0   0]
 [  0   4 192  95   0   0   0]
 [  0   2  61 343  34   0   0]
 [  0   0   3  83  90   0   0]
 [  0   0   0   9   7  19   0]
 [  0   0   0   0   1   0   0]]

The overall performance of the stacked ensemble model was worse with the logistic regression model included, so it was excluded from the final stacked ensemble model:

# Final stacking ensemble model
stacking_ensemble = StackingClassifier(estimators=[('dt',dt_model),('bg',bg_model),('rf',rf_model),('xgb',xgb_model)],
                                   final_estimator=LogisticRegression(random_state=1,max_iter=10000),n_jobs=-1,
                                   cv = StratifiedKFold(n_splits=2,shuffle=True,random_state=1))
stacking_ensemble.fit(X_train,y_train)
stacking_pred = stacking_ensemble.predict(X_test)

4.3 Ensemble of ensembled models (using stacking ensemble)

By Aarti Pappu

#Using Logistic regression as the meta model (final_estimator)
ensemble_model = StackingClassifier(estimators=[('voting',voting_ensemble),('stacking',stacking_ensemble)],
                                   final_estimator=LogisticRegression(random_state=1,max_iter=10000),n_jobs=-1,
                                   cv = StratifiedKFold(n_splits=2,shuffle=True,random_state=1))
ensemble_model.fit(X_train,y_train)
ensemble_pred = ensemble_model.predict(X_test)
print(classification_report(y_test, ensemble_pred, digits=3))
print(confusion_matrix(y_test, ensemble_pred))
              precision    recall  f1-score   support

           3      0.000     0.000     0.000         4
           4      0.692     0.273     0.391        33
           5      0.703     0.674     0.688       291
           6      0.646     0.789     0.710       440
           7      0.689     0.517     0.591       176
           8      1.000     0.543     0.704        35
           9      0.000     0.000     0.000         1

    accuracy                          0.676       980
   macro avg      0.533     0.399     0.441       980
weighted avg      0.682     0.676     0.668       980

[[  0   0   2   2   0   0   0]
 [  0   9  17   7   0   0   0]
 [  0   3 196  91   1   0   0]
 [  0   1  61 347  31   0   0]
 [  0   0   3  82  91   0   0]
 [  0   0   0   8   8  19   0]
 [  0   0   0   0   1   0   0]]